Beijing, China - May 24-26, 2016

Outline

  • Session 1: Motivation, why and how to think about data, and getting started with R
  • Session 2: Making basic plots, grammar of graphics, good practices
  • Session 3: Advanced graphics, layering, using maps

(If you re-started RStudio, be sure to re-open your project too.)

Using the Packages tidyr, dplyr

  • Writing readable code using pipes
  • What is tidy data? Why do you want tidy data? Getting your data into tidy form using tidyr.
  • Summarise, mutate, filter, select, arrange with dplyr
  • Reading different data formats
  • String operations, working with text
  • Re-structuring time variables
  • Computing on lists with purrr (not covered)
  • Handling missing values (not covered)
## Pipes
%>%

Pipes allow the code to be read like a sequence of operations

student2012.sub <- readRDS("../data/student_sub.rds")
# table(student2012.sub$CNT)
student2012.sub %>% group_by(CNT) %>% tally()
#> Source: local data frame [43 x 2]
#> 
#>      CNT     n
#>    (chr) (int)
#> 1    ARE 11500
#> 2    AUS 14481
#> 3    AUT  4755
#> 4    BEL  8597
#> 5    BGR  5282
#> 6    BRA  5506
#> 7    CAN 21544
#> 8    CHL  6856
#> 9    COL  9073
#> 10   CZE  5327
#> ..   ...   ...

Example Data 1

What are the variables?

Inst AvNumPubs AvNumCits PctCompletion
ARIZONA STATE UNIVERSITY 0.90 1.57 32
AUBURN UNIVERSITY 0.79 0.64 44
BOSTON COLLEGE 0.51 1.03 47
BOSTON UNIVERSITY 0.49 2.66 34
BRANDEIS UNIVERSITY 0.30 3.03 49
BROWN UNIVERSITY 0.84 2.31 55

Example Data 2

What's in the column names of this data? What are the experimental units? What are the measured variables?

id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.2 2.20 4.2 2.63 5.1 4.5 5.5 4.4 3.9 4.2 3.73
Gene 2 1.5 0.59 1.9 0.52 2.9 1.4 3.0 1.3 1.2 1.2 0.89
Gene 3 2.0 0.87 3.3 0.53 4.6 2.2 5.6 2.5 2.5 3.0 1.35

Example Data 3

What are the variables? What are the records?

melbtemp <- read.fwf("../data/ASN00086282.dly", 
   c(11, 4, 2, 4, rep(c(5, 1, 1, 1), 31)), fill=T)
kable(head(melbtemp[,c(1,2,3,4,seq(5,128,4))]))
V1 V2 V3 V4 V5 V9 V13 V17 V21 V25 V29 V33 V37 V41 V45 V49 V53 V57 V61 V65 V69 V73 V77 V81 V85 V89 V93 V97 V101 V105 V109 V113 V117 V121 V125
ASN00086282 1970 7 TMAX 141 124 113 123 148 149 139 153 123 108 119 112 126 112 115 133 134 126 104 143 141 134 117 142 158 149 133 143 150 145 115
ASN00086282 1970 7 TMIN 80 63 36 57 69 47 84 78 49 42 48 56 51 36 44 39 40 58 15 33 51 74 39 66 78 36 61 46 42 63 39
ASN00086282 1970 7 PRCP 3 30 0 0 36 3 0 0 10 23 3 0 5 0 0 0 0 0 8 0 18 0 0 0 0 13 3 0 25 0 3
ASN00086282 1970 8 TMAX 145 128 150 122 109 112 116 142 166 127 117 127 159 143 114 65 113 125 129 147 161 168 178 161 145 142 137 150 120 114 129
ASN00086282 1970 8 TMIN 50 61 75 67 41 51 48 -7 56 62 47 33 67 84 11 41 18 50 22 28 74 94 73 88 50 48 54 78 47 18 39
ASN00086282 1970 8 PRCP 0 66 0 53 13 3 8 0 0 0 3 5 0 0 64 3 99 36 8 0 0 0 8 36 25 30 56 5 69 3 20

Example Data 4

What are the variables? What are the experimental units?

tb <- read_csv("../data/tb.csv")
#tail(tb)
colnames(tb)
#>  [1] "iso2"   "year"   "m_04"   "m_514"  "m_014"  "m_1524" "m_2534"
#>  [8] "m_3544" "m_4554" "m_5564" "m_65"   "m_u"    "f_04"   "f_514" 
#> [15] "f_014"  "f_1524" "f_2534" "f_3544" "f_4554" "f_5564" "f_65"  
#> [22] "f_u"

Example Data 5

What are the variables? What are the experimental units?

pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F
)
kable(pew[1:5, 1:5])
religion <$10k $10-20k $20-30k $30-40k
Agnostic 27 34 60 81
Atheist 12 27 37 52
Buddhist 27 21 30 34
Catholic 418 617 732 670
Don’t know/refused 15 14 15 11

Example Data 6

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0

What is the experimental unit? What are the factors of the experiment? What was measured? What do you want to know?

Messy Data Patterns

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.

  • Column headers are values, not variable names
  • Variables are stored in both rows and columns, contingency table format
  • One type of experimental unit stored in multiple tables
  • Dates in many different formats

What is Tidy Data?

  • Each observation forms a row
  • Each variable forms a column
  • Contained in a single table
  • Long form makes it easier to reshape in many different ways
  • Wide form is common for analysis

Tidy Data

Messy Data

Tidy Verbs

  • gather: specify the keys (identifiers) and the values (measures) to make long form (used to be called melting)
  • spread: variables in columns (used to be called casting)
  • nest/unnest: working with lists
  • separate/unite: split and combine columns

French Fries

10 week sensory experiment, 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?), fried in one of 3 different oils, replicated twice. First few rows:

time treatment subject rep potato buttery grassy rancid painty
1 1 3 1 2.9 0.0 0.0 0.0 5.5
1 1 3 2 14.0 0.0 0.0 1.1 0.0
1 1 10 1 11.0 6.4 0.0 0.0 0.0
1 1 10 2 9.9 5.9 2.9 2.2 0.0
1 1 15 1 1.2 0.1 0.0 1.1 5.1
1 1 15 2 8.8 3.0 3.6 1.5 2.3

What Would We Like to Know?

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Each of these questions involves different summaries of the data.

Gathering

  • When gathering, you need to specify the keys (identifiers) and the values (measures).

Keys/Identifiers: - Identify a record (must be unique) - Example: Indices on an random variable - Fixed by design of experiment (known in advance) - May be single or composite (may have one or more variables)

Values/Measures: - Collected during the experiment (not known in advance) - Usually numeric quantities

Gathering the French Fries

ff_long <- gather(french_fries, key = variable, value = 
                    rating, potato:painty)
head(ff_long)
#>   time treatment subject rep variable rating
#> 1    1         1       3   1   potato    2.9
#> 2    1         1       3   2   potato   14.0
#> 3    1         1      10   1   potato   11.0
#> 4    1         1      10   2   potato    9.9
#> 5    1         1      15   1   potato    1.2
#> 6    1         1      15   2   potato    8.8

Long to Wide

In certain applications, we may wish to take a long dataset and convert it to a wide dataset (perhaps displaying in a table).

This is called "spreading" the data.

Spread

We use the spread function from tidyr to do this:

french_fries_wide <- spread(ff_long, key = variable, 
                            value = rating)

head(french_fries_wide)
#>   time treatment subject rep buttery grassy painty potato rancid
#> 1    1         1       3   1     0.0    0.0    5.5    2.9    0.0
#> 2    1         1       3   2     0.0    0.0    0.0   14.0    1.1
#> 3    1         1      10   1     6.4    0.0    0.0   11.0    0.0
#> 4    1         1      10   2     5.9    2.9    0.0    9.9    2.2
#> 5    1         1      15   1     0.1    0.0    5.1    1.2    1.1
#> 6    1         1      15   2     3.0    3.6    2.3    8.8    1.5

Answer some Questions

  • Easiest question to start is whether the ratings are similar on the different scales, potato'y, buttery, grassy, rancid and painty.

  • We need to gather the data into long form, and make plots facetted by the scale.

Ratings on the Different Scales

ff.m <- french_fries %>% 
  gather(type, rating, -subject, -time, -treatment, -rep)
head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8

ggplot(data=ff.m, aes(x=rating)) + geom_histogram(binwidth=2) + 
  facet_wrap(~type, ncol=5) 

Side-By-Side Boxplots

ggplot(data=ff.m, aes(x=type, y=rating, fill=type)) + 
  geom_boxplot()

Do the Replicates Look Like Each Other?

We will start to tackle this by plotting the replicates against each other using a scatterplot.

We need to gather the data into long form, and then get the replicates spread into separate columns.

Check Replicates

head(ff.m)
#>   time treatment subject rep   type rating
#> 1    1         1       3   1 potato    2.9
#> 2    1         1       3   2 potato   14.0
#> 3    1         1      10   1 potato   11.0
#> 4    1         1      10   2 potato    9.9
#> 5    1         1      15   1 potato    1.2
#> 6    1         1      15   2 potato    8.8
ff.s <- ff.m %>% spread(rep, rating)
head(ff.s)
#>   time treatment subject    type   1    2
#> 1    1         1       3 buttery 0.0  0.0
#> 2    1         1       3  grassy 0.0  0.0
#> 3    1         1       3  painty 5.5  0.0
#> 4    1         1       3  potato 2.9 14.0
#> 5    1         1       3  rancid 0.0  1.1
#> 6    1         1      10 buttery 6.4  5.9

Check Replicates

ggplot(data=ff.s, aes(x=`1`, y=`2`)) + geom_point() +
  theme(aspect.ratio=1) + xlab("Rep 1") + ylab("Rep 2")
ggplot(data=ff.s, aes(x=`1`, y=`2`)) + geom_point() +
  theme(aspect.ratio=1) + xlab("Rep 1") + ylab("Rep 2") + 
  scale_x_sqrt() + scale_y_sqrt()

Your Turn

Make the scatterplots of reps against each other separately for scales, and treatment.

Your Turn

Read in the billboard top 100 music data, which contains N'Sync and Backstreet Boys songs that entered the billboard charts in the year 2000

billboard <- read.csv("../data/billboard.csv")

What's in this data? What's X1-X76?

Your Turn

  1. Use tidyr to convert this data into a long format appropriate for plotting a time series (date on the x axis, chart position on the y axis)
  2. Use ggplot2 to create this time series plot:

dplyr Verbs

The package dplyr helps to make various summaries of the data. There are five primary dplyr verbs, representing distinct data analysis tasks:

  • Filter: Remove the rows of a data frame, producing subsets
  • Arrange: Reorder the rows of a data frame
  • Select: Select particular columns of a data frame
  • Mutate: Add new columns that are functions of existing columns
  • Summarise: Create collapsed summaries of a data frame

The Split-Apply-Combine Approach

(Diagram originally from Hadley Wickham)

Split-Apply-Combine in dplyr

french_fries_split <- group_by(ff_long, variable) # SPLIT
french_fries_apply <- summarise(french_fries_split, 
     rating = mean(rating, na.rm = TRUE)) # APPLY + COMBINE
french_fries_apply
#> Source: local data frame [5 x 2]
#> 
#>   variable rating
#>      (chr)  (dbl)
#> 1  buttery   1.82
#> 2   grassy   0.66
#> 3   painty   2.52
#> 4   potato   6.95
#> 5   rancid   3.85

Filter

french_fries %>%
    filter(subject == 3, time == 1)
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    1         1       3   1    2.9     0.0    0.0    0.0    5.5
#> 2    1         1       3   2   14.0     0.0    0.0    1.1    0.0
#> 3    1         2       3   1   13.9     0.0    0.0    3.9    0.0
#> 4    1         2       3   2   13.4     0.1    0.0    1.5    0.0
#> 5    1         3       3   1   14.1     0.0    0.0    1.1    0.0
#> 6    1         3       3   2    9.5     0.0    0.6    2.8    0.0

Arrange

french_fries %>%
    arrange(desc(rancid)) %>%
    head
#>   time treatment subject rep potato buttery grassy rancid painty
#> 1    9         2      51   1    7.3     2.3      0     15    0.1
#> 2   10         1      86   2    0.7     0.0      0     14   13.1
#> 3    5         2      63   1    4.4     0.0      0     14    0.6
#> 4    9         2      63   1    1.8     0.0      0     14   12.3
#> 5    5         2      19   2    5.5     4.7      0     13    4.6
#> 6    4         3      63   1    5.6     0.0      0     13    4.4

Select

french_fries %>%
    select(time, treatment, subject, rep, potato) %>%
    head
#>    time treatment subject rep potato
#> 61    1         1       3   1    2.9
#> 25    1         1       3   2   14.0
#> 62    1         1      10   1   11.0
#> 26    1         1      10   2    9.9
#> 63    1         1      15   1    1.2
#> 27    1         1      15   2    8.8

Summarise

french_fries %>%
    group_by(time, treatment) %>%
    summarise(mean_rancid = mean(rancid), 
              sd_rancid = sd(rancid))
#> Source: local data frame [30 x 4]
#> Groups: time [?]
#> 
#>      time treatment mean_rancid sd_rancid
#>    (fctr)    (fctr)       (dbl)     (dbl)
#> 1       1         1         2.8       3.2
#> 2       1         2         1.7       2.7
#> 3       1         3         2.6       3.2
#> 4       2         1         3.9       4.4
#> 5       2         2         2.1       3.1
#> 6       2         3         2.5       3.4
#> 7       3         1         4.7       3.9
#> 8       3         2         2.9       3.8
#> 9       3         3         3.6       3.6
#> 10      4         1         2.1       2.4
#> ..    ...       ...         ...       ...

Your Turn

  • Use summarise to compute the mean science score for each country in the OECD PISA data

String Manipulation

When the experimental design is packed into column names, we need to extract it, and tidy it up.

genes <- read_csv("../data/genes.csv")
kable(head(genes))
id WI-6.R1 WI-6.R2 WI-6.R4 WM-6.R1 WM-6.R2 WI-12.R1 WI-12.R2 WI-12.R4 WM-12.R1 WM-12.R2 WM-12.R4
Gene 1 2.2 2.20 4.2 2.63 5.1 4.5 5.5 4.4 3.9 4.2 3.73
Gene 2 1.5 0.59 1.9 0.52 2.9 1.4 3.0 1.3 1.2 1.2 0.89
Gene 3 2.0 0.87 3.3 0.53 4.6 2.2 5.6 2.5 2.5 3.0 1.35

Gather Column Names into Long Form

gather(genes, variable, expr, -id) %>% kable
id variable expr
Gene 1 WI-6.R1 2.18
Gene 2 WI-6.R1 1.46
Gene 3 WI-6.R1 2.03
Gene 1 WI-6.R2 2.20
Gene 2 WI-6.R2 0.59
Gene 3 WI-6.R2 0.87
Gene 1 WI-6.R4 4.20
Gene 2 WI-6.R4 1.86
Gene 3 WI-6.R4 3.28
Gene 1 WM-6.R1 2.63
Gene 2 WM-6.R1 0.52
Gene 3 WM-6.R1 0.53
Gene 1 WM-6.R2 5.06
Gene 2 WM-6.R2 2.88
Gene 3 WM-6.R2 4.63
Gene 1 WI-12.R1 4.54
Gene 2 WI-12.R1 1.36
Gene 3 WI-12.R1 2.18
Gene 1 WI-12.R2 5.53
Gene 2 WI-12.R2 2.96
Gene 3 WI-12.R2 5.56
Gene 1 WI-12.R4 4.41
Gene 2 WI-12.R4 1.31
Gene 3 WI-12.R4 2.54
Gene 1 WM-12.R1 3.85
Gene 2 WM-12.R1 1.18
Gene 3 WM-12.R1 2.54
Gene 1 WM-12.R2 4.18
Gene 2 WM-12.R2 1.19
Gene 3 WM-12.R2 3.04
Gene 1 WM-12.R4 3.73
Gene 2 WM-12.R4 0.89
Gene 3 WM-12.R4 1.35

Separate Columns

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  kable
id trt leftover expr
Gene 1 WI 6.R1 2.18
Gene 2 WI 6.R1 1.46
Gene 3 WI 6.R1 2.03
Gene 1 WI 6.R2 2.20
Gene 2 WI 6.R2 0.59
Gene 3 WI 6.R2 0.87
Gene 1 WI 6.R4 4.20
Gene 2 WI 6.R4 1.86
Gene 3 WI 6.R4 3.28
Gene 1 WM 6.R1 2.63
Gene 2 WM 6.R1 0.52
Gene 3 WM 6.R1 0.53
Gene 1 WM 6.R2 5.06
Gene 2 WM 6.R2 2.88
Gene 3 WM 6.R2 4.63
Gene 1 WI 12.R1 4.54
Gene 2 WI 12.R1 1.36
Gene 3 WI 12.R1 2.18
Gene 1 WI 12.R2 5.53
Gene 2 WI 12.R2 2.96
Gene 3 WI 12.R2 5.56
Gene 1 WI 12.R4 4.41
Gene 2 WI 12.R4 1.31
Gene 3 WI 12.R4 2.54
Gene 1 WM 12.R1 3.85
Gene 2 WM 12.R1 1.18
Gene 3 WM 12.R1 2.54
Gene 1 WM 12.R2 4.18
Gene 2 WM 12.R2 1.19
Gene 3 WM 12.R2 3.04
Gene 1 WM 12.R4 3.73
Gene 2 WM 12.R4 0.89
Gene 3 WM 12.R4 1.35

genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>% kable
id trt time rep expr
Gene 1 WI 6 R1 2.18
Gene 2 WI 6 R1 1.46
Gene 3 WI 6 R1 2.03
Gene 1 WI 6 R2 2.20
Gene 2 WI 6 R2 0.59
Gene 3 WI 6 R2 0.87
Gene 1 WI 6 R4 4.20
Gene 2 WI 6 R4 1.86
Gene 3 WI 6 R4 3.28
Gene 1 WM 6 R1 2.63
Gene 2 WM 6 R1 0.52
Gene 3 WM 6 R1 0.53
Gene 1 WM 6 R2 5.06
Gene 2 WM 6 R2 2.88
Gene 3 WM 6 R2 4.63
Gene 1 WI 12 R1 4.54
Gene 2 WI 12 R1 1.36
Gene 3 WI 12 R1 2.18
Gene 1 WI 12 R2 5.53
Gene 2 WI 12 R2 2.96
Gene 3 WI 12 R2 5.56
Gene 1 WI 12 R4 4.41
Gene 2 WI 12 R4 1.31
Gene 3 WI 12 R4 2.54
Gene 1 WM 12 R1 3.85
Gene 2 WM 12 R1 1.18
Gene 3 WM 12 R1 2.54
Gene 1 WM 12 R2 4.18
Gene 2 WM 12 R2 1.19
Gene 3 WM 12 R2 3.04
Gene 1 WM 12 R4 3.73
Gene 2 WM 12 R4 0.89
Gene 3 WM 12 R4 1.35

gtidy <- genes %>%
  gather(variable, expr, -id) %>%
  separate(variable, c("trt", "leftover"), "-") %>%
  separate(leftover, c("time", "rep"), "\\.") %>%
  mutate(trt = sub("W", "", trt)) %>%
  mutate(rep = sub("R", "", rep))
kable(head(gtidy))
id trt time rep expr
Gene 1 I 6 1 2.18
Gene 2 I 6 1 1.46
Gene 3 I 6 1 2.03
Gene 1 I 6 2 2.20
Gene 2 I 6 2 0.59
Gene 3 I 6 2 0.87

Your Turn

  1. Using the tidied dataset (gtidy), find the mean expression for each combination of id, trt, and time.
  2. Use this tidied data to make this plot.

Re-structuring the Temperature Data

melbtemp.m <- melbtemp %>%
  select(num_range("V", c(1,2,3,4,seq(5,128,4)))) %>%
  filter(V4 %in% c("PRCP", "TMAX", "TMIN")) %>%
  gather(day, value, V5:V125, na.rm = TRUE) %>%
  spread(V4, value) %>%
  mutate(
    tmin = as.numeric(TMIN) / 10,
    tmax = as.numeric(TMAX) / 10,
    t_range = tmax - tmin,
    prcp = as.numeric(PRCP) / 10
  ) %>%
  rename(stn=V1, year=V2, month=V3)

kable(head(melbtemp.m))
stn year month day PRCP TMAX TMIN tmin tmax t_range prcp
ASN00086282 1970 7 V101 0 158 78 7.8 16 8.0 0.0
ASN00086282 1970 7 V105 13 149 36 3.6 15 11.3 1.3
ASN00086282 1970 7 V109 3 133 61 6.1 13 7.2 0.3
ASN00086282 1970 7 V113 0 143 46 4.6 14 9.7 0.0
ASN00086282 1970 7 V117 25 150 42 4.2 15 10.8 2.5
ASN00086282 1970 7 V121 0 145 63 6.3 14 8.2 0.0

melbtemp.m$day <- factor(melbtemp.m$day, 
  levels=c("V5","V9","V13","V17","V21","V25","V29",
           "V33","V37","V41","V45","V49","V53","V57",
           "V61","V65","V69","V73","V77","V81","V85",
           "V89","V93","V97","V101","V105","V109",
           "V113","V117","V121","V125"),
  labels=1:31)
melbtemp.m$date <- as.Date(paste(melbtemp.m$day, 
     melbtemp.m$month, melbtemp.m$year, sep="-"),
     "%d-%m-%Y")

kable(head(melbtemp.m))
stn year month day PRCP TMAX TMIN tmin tmax t_range prcp date
ASN00086282 1970 7 25 0 158 78 7.8 16 8.0 0.0 1970-07-25
ASN00086282 1970 7 26 13 149 36 3.6 15 11.3 1.3 1970-07-26
ASN00086282 1970 7 27 3 133 61 6.1 13 7.2 0.3 1970-07-27
ASN00086282 1970 7 28 0 143 46 4.6 14 9.7 0.0 1970-07-28
ASN00086282 1970 7 29 25 150 42 4.2 15 10.8 2.5 1970-07-29
ASN00086282 1970 7 30 0 145 63 6.3 14 8.2 0.0 1970-07-30

Re-structuring Tuberculosis Data

tb_tidy <- tb %>%
  gather(demographic, cases, m_04:f_u, na.rm = TRUE) %>%
  separate(demographic, c("sex", "age"), "_") %>%
  rename(country = iso2) %>%
  arrange(country, year, sex, age) 
kable(head(tb_tidy))
country year sex age cases
AD 1996 f 014 0
AD 1996 f 1524 1
AD 1996 f 2534 1
AD 1996 f 3544 0
AD 1996 f 4554 0
AD 1996 f 5564 1

Dates and Times

Dates are deceptively hard to work with.

Example: 02/05/2012. Is it February 5th, or May 2nd?

Other things are difficult too:

  • Time zones
  • POSIXct format in base R is challenging

The lubridate, and timeDate package helps tackle some of these issues.

Basic Lubridate Use

now()
#> [1] "2016-05-22 19:13:44 AEST"
today()
#> [1] "2016-05-22"
now() + hours(4)
#> [1] "2016-05-22 23:13:44 AEST"
today() - days(2)
#> [1] "2016-05-20"

Parsing Dates

ymd("2013-05-14")
#> [1] "2013-05-14"
mdy("05/14/2013")
#> [1] "2013-05-14"
dmy("14052013")
#> [1] "2013-05-14"
ymd_hms("2013:05:14 14:5:30", tz = "Australia/Melbourne")
#> [1] "2013-05-14 14:05:30 AEST"

Extracting Temporal Elements

month(ymd("2013-05-14"))
#> [1] 5
year(ymd("2013-05-14"))
#> [1] 2013
wday(ymd("2013-05-14"), label=TRUE, abbr=TRUE)
#> [1] Tues
#> Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
isWeekday(ymd("2013-05-14"))
#> 2013-05-14 
#>       TRUE

Your Turn

For the pedestrian sensor data, extract

  • year
  • month
  • day of the week

from the date variable

Using the Package ggplot2

Elements of a plot

  • data
  • aesthetics: mapping of variables to graphical elements
  • geom: type of plot structure to use
  • transformations: log scale, …

Additional components

  • layers: multiple geoms, multiple data sets, annotation
  • facets: show subsets in different plots
  • themes: modifying style

RStudio's cheatsheet gives a nice, concise overview of the plotting capabilities.

Data - Currency Cross Rates

Extracted from http://openexchangerates.org, extracted using the json api, with the R package, jsonlite.

rates <- read_csv("../data/rates.csv")
rates[1:5,1:8]
#> Source: local data frame [5 x 8]
#> 
#>         date   AED   AFN   ALL   AMD   ANG   AOA   ARS
#>       (date) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#> 1 2015-02-23   3.7    57   124   479   1.8   106   8.7
#> 2 2015-02-24   3.7    57   124   479   1.8   106   8.7
#> 3 2015-02-25   3.7    57   124   479   1.8   106   8.7
#> 4 2015-02-26   3.7    58   125   480   1.8   106   8.7
#> 5 2015-02-27   3.7    57   125   479   1.8   106   8.7

If you'd like to collect exchange rates yourself, see here.

Plotting Points

ggplot(data=rates, aes(x=date, y=AUD)) + geom_point()

Data Structure

  • Plots are constructed by mapping elements of data to graphical attributes.
  • Having data in a tidy structure make mapping clearer
  • Some ways of making mappings make it easier for the reader to perceive structure better

Plot Structure

  • data: rates
  • aesthetics: x=date, y=AUD
  • geom: point, line

Plotting Lines

ggplot(data=rates, aes(x=date, y=AUD)) + geom_line()

Points and Lines

ggplot(data=rates, aes(x=date, y=AUD)) + 
  geom_line() + geom_point()

Multiple Currencies

ggplot(data=rates, aes(x=date, y=AUD)) + geom_line() +
  geom_line(aes(y=NZD), colour="blue") + 
  geom_line(aes(y=GBP), colour="red")

Hmmm…

  • That code is clunky!
  • Better to rearrange data, and then let ggplot2 handle the colors, legends, …

Better Way

rates.sub <- select(rates, date, AUD, NZD, GBP)
rates.sub.m <- gather(rates.sub, currency, rate, -date)
ggplot(data=rates.sub.m, aes(x=date, y=rate, colour=currency)) + 
  geom_line() 

GRAMMAR

  • The grammar of graphics makes the mapping of a data variable to a plot element explicit.
  • This is a huge advance in data visualisation
  • This provides a closer connection between data, plots and models.

Mappings

  • Date is mapped to position along the x axis
  • Rate is mapped to position along the y axis
  • Currency is mapped to colour

Information Communication

ggplot(data=rates.sub.m, aes(x=date, y=rate, colour=currency)) + 
  geom_line() 

  • What can you read from this plot? What is the main observation?
  • The cross-rates for AUD and NZD with the USD are similar, $1USD can buy approximately $1.30 of both, but the GBP is lower, and $1USD only buys 2/3 of a GBP. Do we need a plot to know this?

Communicate Trend: Scale Currencies

rates.sub <- mutate(rates.sub, AUD=scale(AUD), NZD=scale(NZD), 
                    GBP=scale(GBP))
rates.sub$date <- as.Date(rates.sub$date)
rates.sub.m <- gather(rates.sub, currency, rate, -date)

ggplot(data=rates.sub.m, aes(x=date, y=rate, colour=currency)) +
         geom_line()

  • Now you can read off the trend: the AUD and NZD trend similarly in this time period, but the GBP is different. The GBP goes down in cross-rate, as the AUD/NZD go up.

Your Turn

In the plot below, how are variables mapped to plot elements?

Adding Marginal Rug Plot

ggplot(data=rates.sub, aes(x=AUD, y=NZD)) + 
  geom_point(alpha=0.2) + geom_rug(colour="red", alpha=0.3) + 
  theme(aspect.ratio=1)

Other Types of Plots

  • bar charts, pie charts
  • boxplots, violins,
  • histograms
  • density plots
  • dotplots

Look up ?geom_histogram and choose the index for the ggplot2 package. Look at the geom_ options. There are many! We will only cover the few main ones.

Type of Variables Suggests Mapping

  1. The values of quantitative variables should be mapped to position along a line, e.g. histogram, scatterplot. Mapping them to colour will yield only rough return of information to the reader.
  2. Categorical variables could be mapped to
  • colour, if there are few categories,
  • aggregated and mapped to position along the line,
  • mapped to angle, if all categories are available.
  1. Order is important, and if no natural order available then impose one e.g. using count

Categorical Variables - Barchart

The social variables of the PISA data include internet usage. This is a subset.

internet <- read_csv("../data/internet.csv")
dim(internet)
#> [1] 37904    20
colnames(internet)
#>  [1] "name"                         "SCHOOLID"                    
#>  [3] "Gender"                       "One player games"            
#>  [5] "Collaborative games"          "Use email"                   
#>  [7] "Chat on line"                 "Social networks"             
#>  [9] "Browse the Internet for fun"  "Read news"                   
#> [11] "Obtain practical information" "Download music"              
#> [13] "Upload content"               "Internet for school"         
#> [15] "Email students"               "Email teachers"              
#> [17] "Download from School"         "Announcements"               
#> [19] "Homework"                     "Share school material"

Categorical Variables - Barchart

ggplot(data=internet, aes(x=`Social networks`)) + 
  geom_bar(binwidth=0.5) 

Categorical Variables - Barchart

Simpson's paradox may be in play when there are multiple categorical variables. Need to divide it into basic elements.

ggplot(data=internet, aes(x=`Social networks`)) + 
  geom_bar(binwidth=0.5) +
  facet_grid(Gender~name)

Categorical Variables - Stacked Barchart

ggplot(data=internet, aes(x=`Social networks`, fill=Gender)) + 
  geom_bar(binwidth=0.5) +
  facet_wrap(~name, ncol=5) + theme(legend.position="bottom")

Categorical Variables - Dodged Bars

ggplot(data=internet) + 
  geom_bar(aes(x=`Social networks`, fill=Gender),
          position="dodge") +
  facet_wrap(~name, ncol=5) + 
  theme(legend.position="bottom")

Categorical Variables - Piechart

ggplot(data=internet, aes(x=factor(1), fill=factor(`Social networks`))) + 
  geom_bar(width = 1) + scale_x_discrete("") +
  scale_y_continuous("") +
  scale_fill_hue("Social Network Use") +
  coord_polar(theta = "y")

Yes, its deliberately made hard to do !

Quantitative and Categorical - Boxplots

Data are measurements from the National Research Council in the USA, evaluating graduate programs in Statistics.

grad <- read_csv("../data/graduate-programs.csv")
dim(grad)
#> [1] 412  16
colnames(grad)
#>  [1] "subject"            "Inst"               "AvNumPubs"         
#>  [4] "AvNumCits"          "PctFacGrants"       "PctCompletion"     
#>  [7] "MedianTimetoDegree" "PctMinorityFac"     "PctFemaleFac"      
#> [10] "PctFemaleStud"      "PctIntlStud"        "AvNumPhDs"         
#> [13] "AvGREs"             "TotFac"             "PctAsstProf"       
#> [16] "NumStud"

ggplot(data=grad, aes(x=subject, y=AvGREs)) + 
  geom_boxplot()

ggplot(data=grad, aes(x=subject, y=AvGREs)) + 
  geom_violin()

Your Turn

  • Create a side-by-side boxplot of average number of publications by program
  • Then answer, "how do the four programs compare in terms of average number of publications?"

Cognitive Principles

  • Hierarchy of mappings: (first) position along an axis - (last) color (Cleveland, 1984; Heer and Bostock, 2009)
  • Pre-attentive: Some elements like color are noticed before you even realise it. Other elements like axes are to look up information later.
  • Color palettes: qualitative, sequential, diverging. The type of variable determines the appropriate palette.
  • Color blindness: you can proof your plots with te dichromat package.
  • Proximity: To compare elements, place them close together.
  • Change blindness: When focus is interrupted differences may not be noticed, can occur when you are reading across multiple plots.

Hierarchy of Mappings

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

Pre-attentive

Can you find the odd one out?

Is it easier now?

Color Palettes

  • Qualitative: categorical variables
  • Sequential: low to high numeric values
  • Diverging: negative to positive values

Scales

ggplot(data=internet, aes(x=`Social networks`, fill=Gender)) + 
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("Female"="orange", "Male"="darkgreen")) + 
  facet_wrap(~name, ncol=5) + 
  theme(legend.position="bottom")

Scales

ggplot(data=grad, aes(x=subject, y=AvGREs)) + 
  geom_boxplot() + scale_y_log10()

Axes

The date time axis is a little trickier to re-organise, but it can be done.

rates.sub.m$date <- as.POSIXct(rates.sub.m$date)
ggplot(data=rates.sub.m, aes(x=date, y=rate, 
  colour=currency)) + geom_line() + 
  scale_x_datetime(breaks = date_breaks("1 month"), 
                   labels = date_format("%b"))

ggplot(data=rates.sub.m, aes(x=date, y=rate, 
        colour=currency)) + geom_line() +
  xlab("Date") + ylab("Standardized rates") + 
  ggtitle("Cross rates 23/2/2015-11/11/2015")

Equations in Labels

ggplot(data=rates.sub.m, aes(x=date, y=rate, colour=currency)) +
  geom_line() +
  xlab(expression(Date[i]^2~ mu ~ pi * sigma)) + 
  ylab("Standardized rates") + 
  ggtitle("Cross rates 23/2/2015-11/11/2015")

Save the plot to an R object,p

p <- ggplot(data=rates.sub.m, aes(x=date, y=rate, 
  colour=currency)) + geom_line() + 
  scale_x_datetime(breaks = date_breaks("1 month"), 
                   labels = date_format("%b"))

Legend Position

p + theme(legend.position = "bottom")

Themes

p + theme_tufte()

p + theme_economist()

Color Palettes

p + scale_color_brewer("", palette = "Dark2")

Color Blind-proofing

clrs <- hue_pal()(3)
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

clrs <- dichromat(hue_pal()(3))
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

clrs <- brewer.pal(3, "Dark2")
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

clrs <- dichromat(brewer.pal(3, "Dark2"))
p + scale_color_manual("", values=clrs) + 
  theme(legend.position = "none")

Your Turn

Proximity - From with plot can you answer: Is the proportion of girls who use social networks every day (4) higher than boys, in Australia? And is this different in Germany?

Your Turn

  • Brainstorm with your neighbour ways to rearrange this plot to answer the question.

  • Then tackle this question: Are German girls more likely to report using social networks once or twice per month (1) than Japanese girls?

  • What ways would you re-arrange the plot to tackle this one?

Proximity

  • It is ok to make more than one plot.
  • Actually it is recommended.

Next

Advanced graphics

Credits